title: “STA 9750 Final-Project: A Tale of 59 NYC Community Districts” author: “Eduardo Alarcon” format: html: code-fold: true code-summary: “Show code” toc: true toc-title: “On this Page” toc-location: left toc-depth: 3 df-print: kable execute: message: false warning: false echo: false
Introduction
The Overarching and Specific Questions
The COVID-19 pandemic established new urban norms, transforming job proximity from a strict requirement into a flexible option through remote work. This paradigm shift necessitates an analysis of how traditional value determinants adapted to these new dynamics. Our research team’s work addresses the overarching question (OQ):
Did COVID-19 reshape the relationship between neighborhood characteristics and property values across NYC’s CDs?
While the team’s broader effort examines crime, density, job accessbility, and transit, this analysis focuses on Educational Attainment (EA). Historically, high-education neighborhoods commanded significant price premiums from well-known agglomeration effects. However, the pandemic disrupted these patterns, leading to this research’s focus, which seeks to answer the specific question (SQ):
Did the strength of the relationship between neighborhood educational attainment and property values change post-COVID, and did this change differ across NYC’s Community Districts?
Hypotheses
Given the pandemic’s impact on work-life demands and redefining dwelling needs, this question explores two hypotheses.
Hypothesis 1: The positive correlation between education and property values would strengthen post-COVID.
Hypothesis 2: High-education CDs would experience stronger property value growth post-COVID as remote work freed professionals to prioritize neighborhood amenities over commute times.
Beyond testing these hypotheses, this analysis also investigates whether pandemic-era shifts represent temporary disruption or a systemic realignment of urban real estate economics.
Data Acquisition and Processing
This analysis integrates four data sources through programmatic acquisition, requiring careful transformation to align incompatible geographic coding systems across federal and city databases.
Setup and Configuration
Loading required packages and defining global constants allows for programmatically assembling a 59 CD-level panel for 2017–2019 vs. 2021–2023 and merges it with baseline 2019 American Community Survey (ACS) education variables.
The following object is masked from 'package:readr':
col_factor
The following object is masked from 'package:purrr':
discard
library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(stringr)library(tibble)library(tidycensus)library(tidyr)library(utils)# Reporting / inferencelibrary(broom) # tidy(), glance()library(knitr) # kable tableslibrary(infer) # t_testlibrary(stats) # descriptive statistics options(dplyr.summarise.inform =FALSE)# NYC counties for ACS (Bronx, Brooklyn, Manhattan, Queens, Staten Island)NYC_COUNTIES <-c("005", "047", "061", "081", "085")# ACS B15003 variables we'll actually useACS_EDUCATION_VARS <-c("B15003_001", # Total population 25+"B15003_022", # BA"B15003_023", # MA"B15003_024", # Professional"B15003_025"# Doctorate)# Period definitions for pre/post COVIDPERIOD_DEFINITIONS <-list(pre_covid =list(acs_end_year =2019, # ACS 2015–2019 (baseline)sales_years =2017:2019,label ="Pre-COVID (2017–2019)" ),post_covid =list(acs_end_year =2019, # SAME baseline ACS 2015–2019sales_years =2021:2023,label ="Post-COVID (2021–2023)" ))# Labels for period variables used in tables/plotsPERIOD_LABELS <-c(pre_covid ="Pre-COVID",post_covid ="Post-COVID")clean_period <-function(x) {recode(x,!!!PERIOD_LABELS,.default =as.character(x) )}# Labels for periodsPERIOD_LABELS <-c(pre_covid ="Pre-COVID",post_covid ="Post-COVID")clean_period <-function(x) {recode(x,!!!PERIOD_LABELS,.default =as.character(x) )}
NYC Community District Shapefile
The get_nyc_cd() function retrieves the official shapefile from the NYC Department of City Planning, unzips it, and constructs standardized identifiers for all 59 Community Districts.
#| label: get-nyc-cd#| include: falseget_nyc_cd <-function() { cd_url <-"https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/community-districts/nycd_25c.zip" zip_path <-"data_raw/nycd_25c.zip" dir_out <-"data_raw/nycd_25c" rds_path <-"data_clean/nycd_25c.rds"# Return cached version if it existsif (file.exists(rds_path)) {return(readRDS(rds_path)) }# Ensure directories existif (!dir.exists("data_raw")) dir.create("data_raw")if (!dir.exists("data_clean")) dir.create("data_clean")# Download shapefile zip if neededif (!file.exists(zip_path)) {download.file(cd_url, destfile = zip_path, mode ="wb") }# Unzip if neededif (!dir.exists(dir_out) ||length(list.files(dir_out, recursive =TRUE)) ==0) {unzip(zip_path, exdir = dir_out) }# Find the actual .shp file inside the unzipped directory shp_path <-list.files( dir_out,pattern ="\\.shp$",full.names =TRUE,recursive =TRUE )[1]# Read shapefile, enforce CRS, and build CD IDs cd_sf <-st_read(shp_path, quiet =TRUE) |>st_transform("EPSG:2263") |># explicit CRS: NAD83 / NY Long Island (ftUS)mutate(boro_cd =as.integer(BoroCD), # e.g., 101, 207, 503boro_num =substr(sprintf("%03d", boro_cd), 1, 1),boro_abbr =case_when( boro_num =="1"~"MN", boro_num =="2"~"BX", boro_num =="3"~"BK", boro_num =="4"~"QN", boro_num =="5"~"SI",TRUE~NA_character_ ),cd_num =sprintf("%02d", boro_cd %%100), # 01–18cd_id =paste0(boro_abbr, cd_num) # MN01, BK03, etc. ) |># keep only the 59 standard CDs, drop Joint Interest Areasfilter(as.integer(cd_num) <=18)# Check for duplicate CD IDs dup_ids <- cd_sf |>count(cd_id) |>filter(n >1)if (nrow(dup_ids) >0) {stop("Duplicate cd_id values found: ", paste(dup_ids$cd_id, collapse =", ")) }# Validate that we have exactly 59 CDsif (nrow(cd_sf) !=59) {stop("Expected 59 Community Districts but got ", nrow(cd_sf)) }message("Successfully loaded 59 Community Districts")saveRDS(cd_sf, rds_path) cd_sf}
PLUTO BBL-to-CD Crosswalk
The get_pluto_cd_crosswalk() function uses raw data from NY Open Data to implement a Borough–Block–Lot (BBL) to CD crosswalk. This process normalizes approximately 870,000 BBLs, resolving formatting inconsistencies and creating standardized linkages between individual tax lots and CDs.
get_pluto_cd_crosswalk <-function() { rds_path <-"data_clean/pluto_cd_crosswalk.rds"if (file.exists(rds_path)) {return(readRDS(rds_path)) }if (!dir.exists("data_clean")) dir.create("data_clean")if (!dir.exists("data_raw")) dir.create("data_raw")# NYC Open Data API for PLUTO (only bbl and cd columns) pluto_url <-paste0("https://data.cityofnewyork.us/resource/64uk-42ks.csv?","$select=bbl,cd&$limit=5000000" )message("Downloading PLUTO crosswalk from NYC Open Data...") pluto_raw <-read_csv( pluto_url,col_types =cols(bbl =col_character(),cd =col_character() ) )message(" Original PLUTO rows: ", nrow(pluto_raw)) pluto_xwalk <- pluto_raw |>filter(!is.na(cd), cd !="0", cd !="99") |>mutate(# Strip trailing ".0", ".00", etc. if presentbbl =str_replace(bbl, "\\.0+$", ""),bbl_length =nchar(bbl),boro_cd =as.integer(cd) ) |># Keep only clean, 10-character BBLsfilter(bbl_length ==10) |>select(bbl, boro_cd)message(" After filtering invalid cd / BBL length: ", nrow(pluto_xwalk))message(" Removed rows with invalid cd or bbl_length != 10")# Diagnostic: check BBL format consistency bbl_lengths <-table(nchar(pluto_xwalk$bbl))if (length(bbl_lengths) >1||names(bbl_lengths)[1] !="10") {warning("BBL length inconsistency detected. All BBLs should be 10 characters.")print(bbl_lengths) }# Lookup table from CD shapefile cd_lookup <-get_nyc_cd() |>st_drop_geometry() |>select(boro_cd, cd_id) |>distinct() crosswalk <- pluto_xwalk |>left_join(cd_lookup, by ="boro_cd") |>filter(!is.na(cd_id)) |>distinct(bbl, cd_id, boro_cd)message(" Crosswalk built with ", nrow(crosswalk), " rows.")message(" CDs represented: ", length(unique(crosswalk$cd_id)))saveRDS(crosswalk, rds_path) crosswalk}
This process removed 1,154 invalid entries (0.1%), preventing silent join failures downstream, as shown in Table 1 below.
The get_dof_sales_year_boro() function automates collection and cleaning Annualized Sales reports from the NYC Department of Finance (DOF). Also, quality filters remove transactions under $10,000, restricting data collection to residential tax classes (e.g., 1, 2, 2A, 2B, and 2C).
A two-stage join strategy combines exact BBL matching with a block-level fallback approach. As standard exact-match joins lose approximately 25% of transactions due to condo billing BBLs, the fallback allows the function to join the remaining unmatched sales record to its specific city block. This process results in 100% match rate, minimizing inaccuracies in analyses within high-density areas, as shown in Table 2.
#| label: define-cd-sales-panel#| include: false# Build CD-Level Sales Panel from DOF + PLUTO build_cd_sales_panel <-function(pre_years =c(2017, 2018, 2019),post_years =c(2021, 2022, 2023),boroughs =c("bronx", "brooklyn", "manhattan", "queens", "staten_island"),pluto_xwalk =NULL) {# 0. Dependencies / inputs if (is.null(pluto_xwalk)) { pluto_xwalk <-get_pluto_cd_crosswalk() }# 1. Year/borough grid combos <-expand_grid(year =c(pre_years, post_years),borough = boroughs ) |>arrange(year, borough)message("=== Building CD sales panel ===")message("Pre-COVID years: ", paste(pre_years, collapse =", "))message("Post-COVID years: ", paste(post_years, collapse =", ")) sales_list <-vector("list", nrow(combos)) match_summary_vec <-vector("list", nrow(combos))# 2. Loop over year × borough combos for (i inseq_len(nrow(combos))) { yr <- combos$year[i] bo <- combos$borough[i]message(" [", i, "/", nrow(combos), "] Processing year ", yr, " – ", bo) sales <-get_dof_sales_year_boro(yr, bo)if (is.null(sales) ||nrow(sales) ==0L) {warning(" No usable rows for ", yr, " ", bo) match_summary_vec[[i]] <-tibble(year = yr,borough = bo,n_total =0L,n_matched =0L,match_rate =NA_real_ )next }# 2a. Match stats vs. PLUTO crosswalk (per-file diagnostic) n_total <-nrow(sales) matched <-inner_join(sales, pluto_xwalk, by ="bbl") n_matched <-nrow(matched) match_rate <-if (n_total >0L) n_matched / n_total elseNA_real_message(sprintf(" n_total = %s, n_matched = %s (%.1f%%)",comma(n_total),comma(n_matched),100* match_rate ) ) sales_list[[i]] <- sales |>mutate(year = yr,borough = bo ) match_summary_vec[[i]] <-tibble(year = yr,borough = bo,n_total = n_total,n_matched = n_matched,match_rate = match_rate ) }# 3. Bind all years/boroughs together all_sales <-bind_rows(sales_list) match_summary_all <-bind_rows(match_summary_vec)message("")message("=== Per-File Match Summary ===")# print(match_summary_all |> arrange(year, borough), n = Inf)assign("sales_match_summary", match_summary_all, envir = .GlobalEnv)# 4. Collapse to CD–year panel with Enhanced Matching # 4a. Exact BBL match sales_matched_exact <- all_sales |>inner_join(pluto_xwalk, by ="bbl") n_exact <-nrow(sales_matched_exact)# 4b. For unmatched, try block-level matching (for condos) sales_unmatched <- all_sales |>anti_join(pluto_xwalk, by ="bbl")if (nrow(sales_unmatched) >0) {message("")message("=== Attempting Block-Level Matching ===")message("Unmatched sales: ", comma(nrow(sales_unmatched)))# Create block-level lookup from PLUTO# For each block, find the most common CD (handles edge cases) pluto_block_lookup <- pluto_xwalk |>mutate(block =as.integer(substr(bbl, 2, 6)),boro_digit =substr(bbl, 1, 1) ) |>count(boro_digit, block, cd_id, boro_cd) |>group_by(boro_digit, block) |>slice_max(n, n =1, with_ties =FALSE) |>ungroup() |>select(boro_digit, block, cd_id, boro_cd)# Match unmatched sales by block sales_matched_block <- sales_unmatched |>mutate(boro_digit =substr(bbl, 1, 1)) |>inner_join( pluto_block_lookup,by =c("boro_digit", "block") ) |>select(-boro_digit) n_block <-nrow(sales_matched_block)message("Block-matched: ", comma(n_block))message("Still unmatched: ", comma(nrow(sales_unmatched) - n_block))# Combine exact and block matches sales_with_cd <-bind_rows( sales_matched_exact, sales_matched_block ) } else { sales_with_cd <- sales_matched_exact n_block <-0 }# 4c. Aggregate to CD-year level cd_panel <- sales_with_cd |>mutate(period =case_when( year %in% pre_years ~"pre_covid", year %in% post_years ~"post_covid",TRUE~NA_character_ ) ) |>filter(!is.na(period)) |>group_by(cd_id, boro_cd, year, period) |>summarise(n_sales =n(),median_price =median(sale_price, na.rm =TRUE),mean_price =mean(sale_price, na.rm =TRUE),sd_price =sd(sale_price, na.rm =TRUE),total_sales_vol =sum(sale_price, na.rm =TRUE),median_gsf =median(gross_square_feet, na.rm =TRUE),n_gsf_nonmissing =sum(!is.na(gross_square_feet)),.groups ="drop" ) |>arrange(year, cd_id)# 5. Overall match statistics total_sales_all_files <-nrow(all_sales) total_sales_matched <- n_exact + n_block overall_match_rate <- total_sales_matched / total_sales_all_filesmessage("")message("=== Overall Match Statistics ===")message("Total sales (all files): ", comma(total_sales_all_files))message("Exact BBL matches: ", comma(n_exact))message("Block-level matches: ", comma(n_block))message("Sales matched to CDs: ", comma(total_sales_matched))message("Sales NOT matched: ",comma(total_sales_all_files - total_sales_matched))message("Overall match rate: ",round(100* overall_match_rate, 1), "%")if (overall_match_rate <0.75) {warning("Overall match rate is below 75% - review BBL construction logic") }# 6. Validate CD coverage cd_sf <-get_nyc_cd() expected_cds <-unique(cd_sf$cd_id) actual_cds <-unique(cd_panel$cd_id) missing_cds <-setdiff(expected_cds, actual_cds)if (length(missing_cds) >0) {warning("Missing CDs in final panel: ",paste(missing_cds, collapse =", "))message(" These CDs may have no matching sales in the selected years") }# 7. Flag low-volume CD-years low_volume <- cd_panel |>filter(n_sales <50)if (nrow(low_volume) >0) {message("")message("Warning: ", nrow(low_volume)," CD-year combinations have fewer than 50 sales:")print( low_volume |>select(cd_id, year, period, n_sales) |>arrange(n_sales) ) }# 8. Final summary message("")message("=== Final CD Sales Panel ===")message("CD sales panel built with ", nrow(cd_panel), " rows.")message("Distinct CDs: ", n_distinct(cd_panel$cd_id))message("Years: ", paste(sort(unique(cd_panel$year)), collapse =", "))message("Periods: ", paste(sort(unique(cd_panel$period)), collapse =", ")) cd_panel}
#| label: matching-diagnostics#| echo: false#| message: false#| warning: false# Run the function (it creates cd_sales_panel and assigns sales_match_summary to global env)cd_sales_panel <-build_cd_sales_panel(pre_years = PERIOD_DEFINITIONS$pre_covid$sales_years,post_years = PERIOD_DEFINITIONS$post_covid$sales_years)
=== Building CD sales panel ===
Pre-COVID years: 2017, 2018, 2019
Post-COVID years: 2021, 2022, 2023
[1/30] Processing year 2017 – bronx
✓ Using cached: 2017 bronx
n_total = 5,165, n_matched = 4,636 (89.8%)
[2/30] Processing year 2017 – brooklyn
✓ Using cached: 2017 brooklyn
n_total = 14,822, n_matched = 10,417 (70.3%)
[3/30] Processing year 2017 – manhattan
✓ Using cached: 2017 manhattan
n_total = 14,058, n_matched = 6,927 (49.3%)
[4/30] Processing year 2017 – queens
✓ Using cached: 2017 queens
n_total = 18,178, n_matched = 15,443 (85.0%)
[5/30] Processing year 2017 – staten_island
✓ Using cached: 2017 staten_island
n_total = 6,282, n_matched = 5,623 (89.5%)
[6/30] Processing year 2018 – bronx
✓ Using cached: 2018 bronx
n_total = 5,301, n_matched = 4,661 (87.9%)
[7/30] Processing year 2018 – brooklyn
✓ Using cached: 2018 brooklyn
n_total = 13,656, n_matched = 9,755 (71.4%)
[8/30] Processing year 2018 – manhattan
✓ Using cached: 2018 manhattan
n_total = 12,365, n_matched = 6,485 (52.4%)
[9/30] Processing year 2018 – queens
✓ Using cached: 2018 queens
n_total = 17,148, n_matched = 14,949 (87.2%)
[10/30] Processing year 2018 – staten_island
✓ Using cached: 2018 staten_island
n_total = 6,007, n_matched = 5,396 (89.8%)
[11/30] Processing year 2019 – bronx
✓ Using cached: 2019 bronx
n_total = 4,913, n_matched = 4,336 (88.3%)
[12/30] Processing year 2019 – brooklyn
✓ Using cached: 2019 brooklyn
n_total = 13,111, n_matched = 9,296 (70.9%)
[13/30] Processing year 2019 – manhattan
✓ Using cached: 2019 manhattan
n_total = 12,898, n_matched = 6,497 (50.4%)
[14/30] Processing year 2019 – queens
✓ Using cached: 2019 queens
n_total = 16,244, n_matched = 14,321 (88.2%)
[15/30] Processing year 2019 – staten_island
✓ Using cached: 2019 staten_island
n_total = 5,308, n_matched = 4,795 (90.3%)
[16/30] Processing year 2021 – bronx
✓ Using cached: 2021 bronx
n_total = 4,877, n_matched = 4,399 (90.2%)
[17/30] Processing year 2021 – brooklyn
✓ Using cached: 2021 brooklyn
n_total = 17,649, n_matched = 11,181 (63.4%)
[18/30] Processing year 2021 – manhattan
✓ Using cached: 2021 manhattan
n_total = 18,417, n_matched = 8,989 (48.8%)
[19/30] Processing year 2021 – queens
✓ Using cached: 2021 queens
n_total = 18,630, n_matched = 15,438 (82.9%)
[20/30] Processing year 2021 – staten_island
✓ Using cached: 2021 staten_island
n_total = 6,923, n_matched = 6,199 (89.5%)
[21/30] Processing year 2022 – bronx
✓ Using cached: 2022 bronx
n_total = 4,703, n_matched = 4,252 (90.4%)
[22/30] Processing year 2022 – brooklyn
✓ Using cached: 2022 brooklyn
n_total = 16,267, n_matched = 10,789 (66.3%)
[23/30] Processing year 2022 – manhattan
✓ Using cached: 2022 manhattan
n_total = 17,017, n_matched = 8,505 (50.0%)
[24/30] Processing year 2022 – queens
✓ Using cached: 2022 queens
n_total = 17,957, n_matched = 14,777 (82.3%)
[25/30] Processing year 2022 – staten_island
✓ Using cached: 2022 staten_island
n_total = 5,795, n_matched = 5,219 (90.1%)
[26/30] Processing year 2023 – bronx
✓ Using cached: 2023 bronx
n_total = 3,566, n_matched = 3,165 (88.8%)
[27/30] Processing year 2023 – brooklyn
✓ Using cached: 2023 brooklyn
n_total = 11,486, n_matched = 8,051 (70.1%)
[28/30] Processing year 2023 – manhattan
✓ Using cached: 2023 manhattan
n_total = 12,779, n_matched = 6,412 (50.2%)
[29/30] Processing year 2023 – queens
✓ Using cached: 2023 queens
n_total = 13,978, n_matched = 11,561 (82.7%)
[30/30] Processing year 2023 – staten_island
✓ Using cached: 2023 staten_island
n_total = 4,326, n_matched = 3,918 (90.6%)
=== Per-File Match Summary ===
=== Attempting Block-Level Matching ===
Unmatched sales: 93,434
Block-matched: 93,434
Still unmatched: 0
=== Overall Match Statistics ===
Total sales (all files): 339,826
Exact BBL matches: 246,392
Block-level matches: 93,434
Sales matched to CDs: 339,826
Sales NOT matched: 0
Overall match rate: 100%
=== Final CD Sales Panel ===
CD sales panel built with 354 rows.
Distinct CDs: 59
Years: 2017, 2018, 2019, 2021, 2022, 2023
Periods: post_covid, pre_covid
# Calculate Stage 1: Exact matches (from sales_match_summary)stage1_total <-sum(sales_match_summary$n_total, na.rm =TRUE)stage1_matched <-sum(sales_match_summary$n_matched, na.rm =TRUE)# Calculate Stage 2: Total sales that made it into the paneltotal_in_panel <-sum(cd_sales_panel$n_sales, na.rm =TRUE)# Stage 2 additions = sales in panel that weren't exact matchesstage2_additions <- total_in_panel - stage1_matched# Calculate match rate percentagematch_rate_pct <-round(100* total_in_panel / stage1_total, 1)# Create summary tabletibble(Stage =c("Total sales (all files)","Stage 1: Exact BBL matches","Stage 2: Block-level matches","Total matched to CDs","Overall match rate" ),Count =c(comma(stage1_total),comma(stage1_matched),comma(stage2_additions),comma(total_in_panel),paste0(match_rate_pct, "%") ),Percentage =c("100%",paste0(round(100* stage1_matched / stage1_total, 1), "%"),paste0(round(100* stage2_additions / stage1_total, 1), "%"),paste0(round(100* total_in_panel / stage1_total, 1), "%"),"" )) |>kable(caption ="**Table 2: Enhanced BBL Matching Results: Two-Stage Approach**",align =c("l", "l", "l") )
Note: The code block below handles all data acquisition and pre-processing to establish the foundational datasets for this research These background processes include structural validations to ensure data integrity before thorough analysis.
Sourced from Table B15003, the American Community Survey (ACS) dataset aggregates educational variables from approximately 2,200 tracts that do not perfectly align well with NYC CD boundaries. Therefore, it is best to use an area-weighted aggregation approach, which is a geographic method that redistributes data between mismatched areas to account for 100% of the population.
#| label: cd-edu-summary#| echo: false#| message: false#| warning: false# 1. Guess which column in cd_education_panel is BA+ (in % or proportion)ba_candidates <-names(cd_education_panel)[grepl("ba", names(cd_education_panel), ignore.case =TRUE) |grepl("bach", names(cd_education_panel), ignore.case =TRUE)]if (length(ba_candidates) ==0) {stop("Could not find a BA+ column in cd_education_panel.\n","Please set ba_col manually to the correct column name." )}ba_col <- ba_candidates[1]# If you know the correct name, you can override:# ba_col <- "YOUR_COLUMN_NAME_HERE"# 2. Standardize period labels to a simple Pre/Post flagcd_edu_tmp <- cd_education_panel |>mutate(period2 =case_when( period %in%c("pre_covid", "Pre-COVID") ~"Pre", period %in%c("post_covid", "Post-COVID") ~"Post",TRUE~NA_character_ ) ) |>filter(period2 =="Pre")# 3. Collapse to one row per CD with baseline BA+ valuecd_edu_summary <- cd_edu_tmp |>group_by(cd_id) |>summarise(ba_raw =first(.data[[ba_col]]),.groups ="drop" )# 4. Convert to percent if necessary (0–1 → 0–100)max_ba <-max(cd_edu_summary$ba_raw, na.rm =TRUE)cd_edu_summary <- cd_edu_summary |>mutate(pct_ba =if (max_ba <=1.5) ba_raw *100else ba_raw )# 5. Build terciles (Low / Medium / High)cd_edu_summary <- cd_edu_summary |>mutate(edu_tercile_num =ntile(pct_ba, 3),edu_tercile =case_when( edu_tercile_num ==1~"Low", edu_tercile_num ==2~"Medium",TRUE~"High" ),edu_tercile =factor(edu_tercile, levels =c("Low", "Medium", "High")) ) |>select(cd_id, pct_ba, edu_tercile)
Treating EA as a fixed baseline is a critical research control. This step is necessary to isolate pure market demand from potential confounding variables. Updating education data post-COVID would obfuscate the source of price changes (e.g., preference changes or population shifts). Consequently, holding education constant at 2019 levels ensures that results yield an accurate measure of changing housing demand.
Table 3 below demonstrates this fixed baseline strategy, showing that the same 2019 ACS education data is consistently applied across all six research years (2017-2023), enabling clean isolation of pandemic-era market shifts.
#| label: integrated-panel-coverage#| echo: false#| message: false#| warning: false# Create integrated table showing sales years matched to fixed ACS baselineintegrated_coverage <- cd_sales_panel |>group_by(year) |>summarise(n_cds =n_distinct(cd_id),.groups ="drop" ) |># Add period labelsmutate(period =case_when( year %in%c(2017, 2018, 2019) ~"Pre-COVID", year %in%c(2021, 2022, 2023) ~"Post-COVID",TRUE~NA_character_ ) ) |># Add ACS baseline year (constant 2019)mutate(acs_baseline =2019) |># Reorder columnsselect(Year = year,Period = period,`ACS Baseline`= acs_baseline,CDs = n_cds ) |>arrange(Year)kable( integrated_coverage,caption ="**Table 3: Sales Years Matched to Fixed ACS Baseline (2019)**",align =c("l", "l", "l", "l"))
Table 3: Sales Years Matched to Fixed ACS Baseline (2019)
Year
Period
ACS Baseline
CDs
2017
Pre-COVID
2019
59
2018
Pre-COVID
2019
59
2019
Pre-COVID
2019
59
2021
Post-COVID
2019
59
2022
Post-COVID
2019
59
2023
Post-COVID
2019
59
# Structural validation checks (silent)stopifnot(# Sales years per periodidentical(sort(unique(cd_sales_panel$year[cd_sales_panel$period =="pre_covid"])), PERIOD_DEFINITIONS$pre_covid$sales_years ),identical(sort(unique(cd_sales_panel$year[cd_sales_panel$period =="post_covid"])), PERIOD_DEFINITIONS$post_covid$sales_years ),# ACS end-years per periodidentical(sort(unique(cd_education_panel$acs_year[cd_education_panel$period =="pre_covid"])), PERIOD_DEFINITIONS$pre_covid$acs_end_year ),identical(sort(unique(cd_education_panel$acs_year[cd_education_panel$period =="post_covid"])), PERIOD_DEFINITIONS$post_covid$acs_end_year ))
Temporal Scope and Final Integration
The core of this analysis compares two three-year periods: pre-COVID (2017-2019) and post-COVID (2021-2023).
#| label: temporal-coverage-summary#| echo: falsetemporal_scope <-tibble(Component =c("Pre-COVID sales period","Post-COVID sales period", "Excluded year","Education data (baseline)" ),Detail =c("2017, 2018, 2019 (3 years)","2021, 2022, 2023 (3 years)","2020 (pandemic disruption)","ACS 2015–2019 (5-year), used as baseline for both periods" ))kable( temporal_scope,caption ="**Table 4: Temporal Scope of Analysis**",col.names =c("Component", "Detail"),align =c("l", "l"))
Table 4: Temporal Scope of Analysis
Component
Detail
Pre-COVID sales period
2017, 2018, 2019 (3 years)
Post-COVID sales period
2021, 2022, 2023 (3 years)
Excluded year
2020 (pandemic disruption)
Education data (baseline)
ACS 2015–2019 (5-year), used as baseline for both periods
Omitting 2020 is essential to ensure analysis integrity. Given the acute shocks from this year, any statistical anomalies may distort long-term trend analysis; thus, bypassing it yields a clearer view of the market’s post-COVID response.
The final data integration shown in Table 5 confirms critical data merging (e.g., median prices by CD-year) with education baselines held constant at 2019 levels.
# Validation assertions (silent)stopifnot("Each CD–period should have 3 years"=all(edu_checks$n_years ==3L),"Each CD–period should have constant pct_ba_plus"=all(edu_checks$n_edu_values ==1L),"Each CD–period should have constant acs_year"=all(edu_checks$n_acs_years ==1L))
The approaches highlighted in the Data Acquisition and Processing section creates a balanced panel of 354 CD-year observations, with 59 CDs encompassing six years worth of data. This structure facilitates this research’s Difference-in-Differences (DiD) analysis, ensuring that any identified trends accurately link to post-pandemic shifts.
Pre-COVID Analytical Framework
Creating the Analysis Set
This baseline analysis addresses the OQ by establishing EA as a strong neighborhood predictor pre-pandemic. Quantifying this “education premium” establishes the context for the study to determine whether the pandemic weakened or strengthened the link between EA and property values.
To analyze how these different CDs responded to the pandemic, this research applied a non-parametric, tercile grouping approach, stratifying EAs into “Low,” “Medium,” and “High” tiers to mitigate outlier effects when comparing distributions.
Table 6 shows clear delineation, with Low-education CDs yielding a 19% BA+ Attainment average, while High-education CDs average 53% BA+ Attainment. This variation establishes a tangible baseline for comparing how CD housing markets evolved during the pandemic era.
### Chunk 8 - Create Analysis Dataset#| label: create-analysis-dataset#| code-fold: true#| code-summary: "Show code: Create analysis dataset with terciles and price changes"#| output: false# STEP 1: Collapse 354 CD-year observations to 59 CD-level averages# This averaging reduces year-to-year noise and creates structure for DiD analysiscd_analysis_simple <- cd_sales_panel |>mutate(# Simplified period labels for pivot_wider operationperiod_group =case_when( period =="pre_covid"~"pre", period =="post_covid"~"post",TRUE~NA_character_ ) ) |># Calculate average prices and total sales by CD and periodgroup_by(cd_id, period_group) |>summarise(avg_median_price =mean(median_price, na.rm =TRUE),avg_mean_price =mean(mean_price, na.rm =TRUE),total_sales =sum(n_sales, na.rm =TRUE),.groups ="drop" ) |># Pivot to wide format: one row per CD with pre/post columnspivot_wider(names_from = period_group,values_from =c(avg_median_price, avg_mean_price, total_sales),names_glue ="{.value}_{period_group}" ) |># STEP 2: Calculate price changes in both dollar and log termsmutate(price_change_dollars = avg_median_price_post - avg_median_price_pre,price_change_pct = (price_change_dollars / avg_median_price_pre) *100,# Log transformation for percentage-based interpretationlog_price_change =log(avg_median_price_post) -log(avg_median_price_pre),log_price_change_pct = log_price_change *100 )# STEP 3: Add baseline education data (2019 ACS, held constant)cd_analysis_simple <- cd_analysis_simple |>left_join( cd_education_panel |>filter(period =="pre_covid") |>select( cd_id,pct_ba_plus_2019 = pct_ba_plus,total_pop_25plus_2019 = total_pop_25plus,ba_plus_2019 = ba_plus ),by ="cd_id" )# STEP 4: Create education terciles (divide 59 CDs into 3 equal groups)# Calculate tercile breaks at 33rd and 67th percentileseducation_tercile_breaks <-quantile( cd_analysis_simple$pct_ba_plus_2019,probs =c(0, 1/3, 2/3, 1),na.rm =TRUE)cd_analysis_simple <- cd_analysis_simple |>mutate(# Assign each CD to Low/Medium/High based on tercile breaksedu_tercile =cut( pct_ba_plus_2019,breaks = education_tercile_breaks,labels =c("Low", "Medium", "High"),include.lowest =TRUE ),# Extract borough from CD ID (first 2 characters)borough =case_when(substr(cd_id, 1, 2) =="MN"~"Manhattan",substr(cd_id, 1, 2) =="BK"~"Brooklyn",substr(cd_id, 1, 2) =="BX"~"Bronx",substr(cd_id, 1, 2) =="QN"~"Queens",substr(cd_id, 1, 2) =="SI"~"Staten Island",TRUE~NA_character_ ) )# STEP 5: Rename columns for clarity in subsequent analysiscd_analysis_simple <- cd_analysis_simple |>rename(price_pre = avg_median_price_pre,price_post = avg_median_price_post,n_sales_pre = total_sales_pre,n_sales_post = total_sales_post )
This tercile structure enables parallel-trends testing.
Note: The 40.1 percentage point (pp) gap between high and low tercile means (59.5% - 19.4%) will be part of later internal consistency checks in the regression analysis.
Pre-Trend Diagnostics
A DiD design is a favorable approach to filter out factors (e.g., economic shifts and neighborhood characteristics) impacting trends, allowing for a strict focus on the pandemic’s effect. This research evaluates a Parallel-Trends Assumption (PTA) framework, to support a DiD interpretation of genuine structural shift in market behavior rather than pre-existing trends.
Logarithmic Transformation
To meaningfully execute this comparison, this analysis implements a logarithmic transformation of property values. Because High-EA CDs begin at significantly higher baselines, analyzing raw dollars would blur the comparison of trends. This standardization process yields relative appreciation rates, ensuring comparability across terciles. For example, a 0.10 log point shift results in an approximate 10% change in value.
Figure 1 shows how High-EA CDs start at roughly twice the price level of low-education districts. However, the three trajectories rise at similar rates, reinforcing the PTA.
#| label: fig-pretrend-diagnostic#| echo: false#| message: false#| warning: false#| fig-width: 8#| fig-height: 5# STEP 1: Prepare pre-trend data for visualization# Filter to pre-COVID years only and join tercile assignmentspretrend_data <- cd_sales_panel |>filter(period =="pre_covid") |>left_join( cd_analysis_simple |>select(cd_id, edu_tercile),by ="cd_id" ) |># STEP 2: Calculate average log price by tercile and year# Log transformation normalizes price differences across tercilesgroup_by(edu_tercile, year) |>summarise(avg_log_price =mean(log(median_price), na.rm =TRUE),n_cds =n(),.groups ="drop" ) |>filter(!is.na(edu_tercile))# STEP 3: Create parallel trends visualizationggplot( pretrend_data, aes(x = year, y = avg_log_price, color = edu_tercile, group = edu_tercile)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_color_manual(values =c("Low"="#d95f02", "Medium"="#7570b3", "High"="#1b9e77"),name ="Education Level" ) +scale_x_continuous(breaks =c(2017, 2018, 2019)) +labs(title ="Figure 1: Pre-COVID Price Trends by Education Level (2017–2019)",subtitle ="Average log median sale price by education tercile",x ="Year",y ="Log Median Sale Price" ) +theme_minimal(base_size =13) +theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11),panel.grid.minor =element_blank(),panel.grid.major.x =element_blank() )
Table 7 below highlights stable growth consistency, with annual growth ranging from 3.0% to 8.2%.
Table 7: Pre-COVID Annual Price Growth by Education Group
Education Group
Log-Point Slope
Annual % Growth
Low
0.079
8.2
Medium
0.030
3.0
High
0.043
4.4
# Calculate maximum slope difference for parallel trends test# Small difference (<0.05) supports parallel trends assumptionmax_diff <-max(pretrend_slopes$slope) -min(pretrend_slopes$slope)
Small inter-group differences of 0.049 satisfies the PTA requirements, confirming that post-pandemic changes represent structural shifts rather than trajectory continuations.
Education and Value: A Lesson on Geography
NYC’s exceptionally diverse EA landscape requires establishing baseline disparities before testing pandemic impacts, as shown in Table 8 below.
EA varies widely across CDs (SD ≈ 19.8 pp). The 18.9 pp gap between the 25th and 75th percentiles supports tercile grouping, which better accounts for extreme shifts in CD behavior.
Table 9 below further highlights theses disparities, with BA+ Attainment ranging between 11.8% (e.g., BX01, the South Bronx) to a maximum of 82.7% (e.g., MN05, the Upper West Side), representing a 70 pp difference.
Table 4. Community Districts with Highest and Lowest Educational Attainment (2019)
Rank
CD ID
Borough
BA+ Attainment
Top 3
MN05
Manhattan
82.5%
Top 3
MN01
Manhattan
81.5%
Top 3
MN08
Manhattan
81.1%
Bottom 3
BX01
Bronx
11.7%
Bottom 3
BX05
Bronx
12.2%
Bottom 3
BX06
Bronx
12.9%
The interactive Leaflet below highlights EA disparities across CDs. Clicking on an individual CD reveals its percentage of EA.
#| label: education-map#| echo: false#| fig-width: 8#| fig-height: 6#| fig-cap: "**Figure 2. Educational Attainment by Community District (2019).** Percent of adults age 25+ with a Bachelor's degree or higher. Data from ACS 2015-2019 (5-year), area-weighted to Community District boundaries."library(leaflet)library(sf)# Join education to CD polygonscd_shp_edu <- nyc_cd |>left_join( cd_analysis_simple |>select(cd_id, pct_ba_plus_2019),by ="cd_id" ) |>mutate(# Pre-format display valuesba_display =paste0(round(pct_ba_plus_2019, 1), "%"),# Create popup HTMLpopup_html =paste0("<strong style='font-size: 14px;'>", cd_id, "</strong><br/>","<strong>BA+ Attainment:</strong> ", ba_display ) )# Transform to WGS84cd_shp_edu_wgs84 <-st_transform(cd_shp_edu, 4326)# Create color palette (viridis to match original)pal <-colorNumeric(palette ="viridis",domain = cd_shp_edu_wgs84$pct_ba_plus_2019)# Create leaflet mapleaflet(cd_shp_edu_wgs84, width ="100%", height ="500px") |>addProviderTiles(providers$CartoDB.Positron) |>addPolygons(fillColor =~pal(pct_ba_plus_2019),fillOpacity =0.7,color ="white",weight =2,opacity =1,popup =~popup_html,highlightOptions =highlightOptions(weight =3,color ="#666",fillOpacity =0.9,bringToFront =TRUE ),label =~cd_id,labelOptions =labelOptions(style =list("font-weight"="normal", padding ="3px 8px"),textsize ="12px",direction ="auto" ) ) |>addLegend(pal = pal,values =~pct_ba_plus_2019,title ="<strong>BA+ Attainment</strong>",position ="bottomright",opacity =0.9,labFormat =labelFormat(suffix ="%", transform =function(x) round(x, 0)) )
The Education Premium
Pre-COVID, there was a strong linear correlation (r = 0.77) between neighborhood EA and its property value.
#| label: baseline-scatter#| echo: false#| fig-width: 8#| fig-height: 5#| fig-cap: "**Figure 3. Pre-COVID Baseline: Education and Property Values.** Higher-education Community Districts had substantially higher median sale prices before the pandemic (r = 0.77, p < 0.001)."# Calculate correlation for captionbaseline_cor <-cor( cd_analysis_simple$pct_ba_plus_2019, cd_analysis_simple$price_pre,use ="complete.obs")ggplot( cd_analysis_simple,aes(x = pct_ba_plus_2019, y = price_pre)) +geom_point(aes(color = borough), size =3, alpha =0.7) +geom_smooth(method ="lm", se =TRUE, color ="black", linewidth =1) +scale_y_continuous(labels =dollar_format(scale =1e-3, suffix ="K"),name ="Average Median Sale Price (2017–2019)" ) +scale_x_continuous(labels =percent_format(scale =1),name ="Adults with BA+ (2019)" ) +scale_color_brewer(palette ="Set2", name ="Borough") +labs(title ="Pre-COVID Baseline: Education and Property Values",subtitle =paste0("Correlation: r = ", round(baseline_cor, 2)) ) +theme_minimal(base_size =13) +theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11) )
`geom_smooth()` using formula = 'y ~ x'
The simple linear regression below models this premium.
\[\text{Median Price}_{\text{pre}} = \beta_0 + \beta_1 \times \text{BA+\%}_{2019} + \epsilon\] Table 5a presents the full regression results, showing that each additional pp of BA+ Attainment predicts a $13,816 higher median sale prices, which is statistically significant (α = 0.05, p < 0.001).
#| label: baseline-regression#| echo: false# STEP 1: Estimate baseline regression (Pre-COVID)# Model: Median Price = β₀ + β₁(BA+ %) + εbaseline_model <-lm(price_pre ~ pct_ba_plus_2019, data = cd_analysis_simple)# Extract coefficients for use in narrative textbaseline_slope <-coef(baseline_model)[2]baseline_intercept <-coef(baseline_model)[1]slope_thousands <-round(baseline_slope /1000, 1) # Convert to thousands for display# STEP 2: Create clean regression table using broom packagelibrary(broom)regression_table <-tidy(baseline_model, conf.int =TRUE) |>mutate(# Replace generic term names with descriptive labelsterm =c("Intercept", "BA+ Attainment (%)"),# Format numeric columns as currency/thousandsacross(c(estimate, std.error, conf.low, conf.high), ~comma(round(.x, 0))),# Round test statisticsstatistic =round(statistic, 2),# Format p-values (show "< 0.001" for very small values)p.value =ifelse(p.value <0.001, "< 0.001", round(p.value, 3)) ) |>select(Term = term,Coefficient = estimate,`Std. Error`= std.error,`95% CI Lower`= conf.low,`95% CI Upper`= conf.high,`t-statistic`= statistic,`p-value`= p.value )# Display regression coefficients tablekable( regression_table,caption ="**Table 5a: Pre-COVID Baseline Regression: Median Sale Price on Educational Attainment**",align =c("l", rep("r", 6)))
Table 5a: Pre-COVID Baseline Regression: Median Sale Price on Educational Attainment
Term
Coefficient
Std. Error
95% CI Lower
95% CI Upper
t-statistic
p-value
Intercept
234,656
63,274
107,952
361,361
3.71
< 0.001
BA+ Attainment (%)
13,842
1,486
10,866
16,818
9.31
< 0.001
Consequently, the full regression equation appears as:
Moreover, EA explains 59.5% of the variation in property values across CDs, as shown in Table 5b below.
#| label: baseline-model-fit-statistics#| echo: false# Extract and display model fit statistics# R² shows proportion of variance explained by education# Extract model fit statisticsmodel_fit <-glance(baseline_model) |>transmute(`R²`=round(r.squared, 3),`Adjusted R²`=round(adj.r.squared, 3),`Residual SE`=comma(round(sigma, 0)),`F-statistic`=round(statistic, 2),`p-value`="< 0.001" )kable( model_fit,caption ="**Table 5b: Model Fit Statistics**",align =rep("r", 5))
Table 5b: Model Fit Statistics
R²
Adjusted R²
Residual SE
F-statistic
p-value
0.603
0.596
225,270
86.74
< 0.001
Although EA appears as a dominant neighborhood predictor pre-COVID, the analysis below demonstrates a statistically significant disruption to this trend, exposing a sharp post-pandemic reversal in the education premium.
Post-COVID Analysis and Results
The Education Reversal
The post-COVID scatter plot below reveals a weaker but still positive relationship between EA and property values (r = 0.74, down from r = 0.77 pre-COVID).
#| label: postcovid-scatter#| echo: false#| fig-width: 8#| fig-height: 5#| fig-cap: "**Figure X. Post-COVID: Education and Property Values.** The positive relationship between neighborhood education and property values remains, but it is weaker than in the pre-pandemic period."# Correlation between baseline education and post-COVID pricespost_cor <-cor( cd_analysis_simple$pct_ba_plus_2019, cd_analysis_simple$price_post,use ="complete.obs")ggplot( cd_analysis_simple,aes(x = pct_ba_plus_2019, y = price_post)) +geom_point(aes(color = borough), size =3, alpha =0.7) +geom_smooth(method ="lm", se =TRUE, color ="black", linewidth =1) +scale_y_continuous(labels =dollar_format(scale =1e-3, suffix ="K"),name ="Average Median Sale Price (2021–2023)" ) +scale_x_continuous(labels =percent_format(scale =1),name ="Adults with BA+ (2019)" ) +scale_color_brewer(palette ="Set2", name ="Borough") +labs(title ="Post-COVID: Education and Property Values",subtitle =paste0("Correlation: r = ", round(post_cor, 2)) ) +theme_minimal(base_size =13) +theme(legend.position ="bottom",plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11) )
`geom_smooth()` using formula = 'y ~ x'
While high EA neighborhoods maintain an absolute price advantage, the correlation’s decline suggests the education premium diminished during the pandemic years, leading to the rejection of Hypothesis 1.
The regression line’s flatter slope indicates that each additional pp of BA+ Attainment predicts a smaller price differential than in the pre-COVID period. Table 5 provides additional clarity with actual price changes across terciles, revealing which groups appreciated fastest.
Table 5. Pre- and Post-COVID Median Prices and Changes by Education Group
Education Group
CDs
Pre-COVID Median
Post-COVID Median
Change ($)
Change (%)
Low
20
$517,091
$641,946
$124,854
26.03%
Medium
19
$720,373
$813,311
$92,938
15.03%
High
20
$1,031,182
$1,127,420
$96,237
11.86%
All CDs
59
$756,823
$861,699
$104,875
17.68%
These findings indicate a remarkable reversal of the traditional education premium. From 2017-19 to 2021-23, Low-education CDs experienced 26.03% median price growth, whereas Medium-education CDs and High-education CDs grew only by 15.03% and 11.86%, respectively. The most notable finding is that High-education CDs grew at less than half the rate of its Low-education counterparts, representing a 14.2% point difference.
This result rejects Hypothesis 2, as Low-education CDs saw the fastest appreciation. The Leaflet map below highlights this appreciation pattern across all CDs.
To contextualize this reversal: a median-priced home in a low-education CD (e.g., BX07) gained approximately $252,500 in value, compared to $5,000 in a high-education CD (e.g. MN07), representing a $247,500 difference attributable to the EA composition of the neighborhood.
The t-test in Table 6 below confirms this pattern.
Dropping unused factor levels Medium from the supplied explanatory variable
'edu_tercile'.
Table 6. Difference in Average Price Growth Between High- and Low-Education CDs
Comparison
Difference
95% CI
t-statistic
df
p-value
High − Low
-14.2 pp
[-22.9, -5.4]
-3.27
37
0.002
Differences between high and low terciles is statistically significant at p = 0.002, with a 95% confidence interval entirely excluding zero. As a result, this outcome indicates that the pattern is unlikely to have occurred by chance.
Figure 4 below shows the profound magnitude of this reversal.
Figure 4. Property Value Growth by Education Tercile. Mean percentage change in median sale price with 95% confidence intervals. Error bars represent uncertainty in the average CD-level price change within each tercile.
Non-overlapping confidence intervals confirm distinct economic outcomes between the High and Low EA groups, indicating the 14.2 pp divergence represents structural shifts rather than anomaly.
Parametric Regression Analysis
While terciles demonstrate reversal magnitude, a modified continuous regression quantifies how each incremental BA+ percentage point influenced post-COVID appreciation.
As shown in Table 7, each additional pp of BA+ Attainment predicts 0.376 pp less price growth. This statistically significant (p < 0.001) negative coefficient directly contradicts the pre-COVID pattern, where higher EA predicted higher prices. The relationship has not only weakened, it has reversed.
#| label: main-did-regression#| echo: false# STEP 1: Estimate post-COVID regression# Model: Price Change (%) = β₀ + β₁(BA+ %) + εsimple_lm <-lm( price_change_pct ~ pct_ba_plus_2019,data = cd_analysis_simple)# Extract coefficients for use in narrative textpost_slope <-coef(simple_lm)[2]post_intercept <-coef(simple_lm)[1]# STEP 2: Create clean regression table using broom packageregression_table <-tidy(simple_lm, conf.int =TRUE) |>mutate(# Replace generic term names with descriptive labelsterm =c("Intercept", "BA+ Attainment (%)"),# Format numeric columns for displayacross(c(estimate, std.error, conf.low, conf.high), ~round(.x, 3)),# Round test statisticsstatistic =round(statistic, 2),# Format p-values (show "< 0.001" for very small values)p.value =ifelse(p.value <0.001, "< 0.001", round(p.value, 4)) ) |>select(Term = term,Coefficient = estimate,`Std. Error`= std.error,`95% CI Lower`= conf.low,`95% CI Upper`= conf.high,`t-statistic`= statistic,`p-value`= p.value )# Display regression coefficients tablekable( regression_table,caption ="**Table 7. Post-COVID Regression: Price Growth on Educational Attainment**",align =c("l", rep("r", 6)))
Table 7. Post-COVID Regression: Price Growth on Educational Attainment
Term
Coefficient
Std. Error
95% CI Lower
95% CI Upper
t-statistic
p-value
Intercept
32.006
3.340
25.317
38.695
9.58
< 0.001
BA+ Attainment (%)
-0.380
0.078
-0.537
-0.223
-4.84
< 0.001
Comparing pre- and post-COVID models reveals this shift’s extent.
In the pre-COVID period, higher education predicted higher absolute prices, with each pp of BA+ Attainment adding nearly $14,000 to median home values.
Moreover, the post-COVID regression yields an \(R^2\) of 0.282, indicating that baseline EA alone explains 28.2% of the variation in price appreciation. Although lower than the pre-COVID model, this \(R^2\) yields acceptable explanatory power for a growth metric, confirming that EA remained a primary, yet inverted, driver of market disparity during the pandemic.
# STEP 3: Create regression_glance object for model fit statisticsregression_glance <-glance(simple_lm)# STEP 4: Extract and display model fit statisticsmodel_fit <- regression_glance |>transmute(`R²`=round(r.squared, 3),`Adjusted R²`=round(adj.r.squared, 3),`Residual SE`=paste0(round(sigma, 2), " pp"),`F-statistic`=round(statistic, 2),`p-value`="< 0.001" )kable( model_fit,caption ="**Table 8. Model Fit Statistics**",align =rep("r", 5))
Table 8. Model Fit Statistics
R²
Adjusted R²
Residual SE
F-statistic
p-value
0.291
0.279
11.89 pp
23.42
< 0.001
Internal Consistency Check
The tercile-based and regression-based approaches should yield consistent estimates if the education-growth relationship is approximately linear.
From Table 7, each additional pp increase in BA+ Attainment predicts a -0.376 pp change in price growth. Moreover, Table 1 highlighted the average education gap between High and Low terciles to be 40.1 pp.
Using the regression coefficient, it is possible to predict the expected difference:
This means the regression model predicts that High-EA CDs should grow 15.1 pp less than Low-EA CDs.
From Table 5, we observed that Low-EA CDs actually grew 14.2 pp more than High-EA CDs (26.03% - 11.86% = 14.17% ≈ 14.2 pp).
Table 9 compares these two estimates to assess internal consistency.
#| label: main-did-consistency-check#| echo: false# Part 5: Internal consistency checkedu_coef <-coef(simple_lm)[2]edu_gap <- cd_analysis_simple |>group_by(edu_tercile) |>summarise(avg_edu =mean(pct_ba_plus_2019, na.rm =TRUE), .groups ="drop") |>filter(edu_tercile %in%c("Low", "High")) |>summarise(gap = avg_edu[edu_tercile =="High"] - avg_edu[edu_tercile =="Low"]) |>pull(gap)predicted_did <- edu_gap * edu_coefobserved_did <- did_estimate_rawconsistency_table <-tibble(Quantity =c("Average education gap (High − Low)","Regression coefficient (pp per 1% BA+)","Predicted DiD from regression","Observed DiD from tercile table" ),Value =c(paste0(round(edu_gap, 1), " pp"),round(edu_coef, 3),paste0(round(predicted_did, 1), " pp"),paste0(round(observed_did, 1), " pp") ))kable( consistency_table,caption ="**Table 9. Internal Consistency Check: Tercile DiD vs. Continuous Regression**",align =c("l", "r"))
Table 9. Internal Consistency Check: Tercile DiD vs. Continuous Regression
Quantity
Value
Average education gap (High − Low)
40.4 pp
Regression coefficient (pp per 1% BA+)
-0.38
Predicted DiD from regression
-15.3 pp
Observed DiD from tercile table
14.2 pp
The regression-based prediction (-15.1 pp, from High’s perspective) closely matches the observed tercile difference (+14.2 pp, from Low’s perspective). This close correspondence (e.g., less than a 7% rate of change) confirms internal consistency between the non-parametric (tercile) and parametric (regression) approaches.
Whether comparing discrete education groups or modeling continuous relationships, the conclusion remains the same: Low-EA CDs experienced substantially faster price growth during the post-COVID period, with the magnitude of this reversal measuring approximately 14-15 pp.
Robustness: Citywide Validation by Borough
The previous sections highlighted stark differences in EA and property values in two boroughs. However, it is essential to examine whether this inverted education-growth relationship holds across all five boroughs, as each contain different demographic compositions and unique pandemic experiences.
Table 10. Distribution of CD-Level Price Growth by Borough
Borough
CDs
Mean Change
SD
Min
Max
Manhattan
12
6.4%
15.9%
-16.1%
45.7%
Bronx
12
31.7%
11.7%
4.8%
51.5%
Staten Island
3
22.6%
1.5%
21.6%
24.3%
Queens
14
16.9%
11.7%
2%
38.2%
Brooklyn
18
15.7%
8.6%
1.1%
39.4%
Table 10 reveals that pandemic-era property value growth occurred across all five boroughs, though growth rates varied. However, there are significant variations between boroughs. Specifically, Manhattan exhibited the highest volatility (\(SD = 15.9\%\))). Queens showed similar variation (\(SD=11.7\%\)), while Brooklyn remained relatively more consistent (\(SD=8.6\%\)).
Table 11 quantifies how EA impacts property values within each borough.
Table 11. Education–Growth Relationship by Borough
Borough
CDs
Correlation (r)
Slope (pp per 1% BA+)
Staten Island
3
-0.991
-0.682
Bronx
12
-0.346
-0.476
Manhattan
12
-0.389
-0.298
Queens
14
-0.221
-0.230
Brooklyn
18
-0.009
-0.005
There is a consistent negative relationship between EA and price growth across all five boroughs. All boroughs show negative correlations and negative slopes, confirming the pattern extends beyond any single market.
Staten Island (SI), with its three CDs, exhibits the strongest negative relationship (r = -0.990, slope = -0.721).
Brooklyn’s near-zero slope (-0.005) and weak correlation (-0.008), suggesting that other factors (e.g., gentrification) drove appreciation more than EA. Additionally, its close proximity to Manhattan may have likely outweighed EA in determining appreciation. Consequently, Brooklyn’s unique outcome calls for deeper investigation in future research.
Figure 5 visualizes these borough-specific slopes for direct comparison.
#| label: robustness-slope-plot#| echo: false#| fig-width: 7#| fig-height: 4#| fig-cap: "**Figure 5. Education–Growth Relationship Varies by Borough.** All five boroughs show negative slopes (higher education → lower price growth), with Staten Island showing the strongest effect."ggplot( borough_slopes,aes(x =reorder(borough, slope), y = slope)) +geom_col(fill ="#7570b3", width =0.6, alpha =0.8) +geom_hline(yintercept =0, linetype ="dashed", color ="black") +labs(title ="Education–Growth Relationship Varies by Borough",subtitle ="Slope: Percentage points of price growth per 1pp increase in BA+ attainment",x ="Borough",y ="Slope (pp per 1% BA+)" ) +coord_flip() +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11),panel.grid.major.y =element_blank() )
This figure confirms that no borough experienced the traditional positive education-growth relationship during the post-COVID period. In four of five boroughs, there was a moderate-to-strong negative relationship, with slopes ranging from -0.223 to -0.721.
Discussion
Interpretation
Several factors likely explain this borough-wide education premium reversal.
These factors likely worked together in as a systemic feedback loop. As the agglomeration premium of high-EA CDs diminished, market demand shifted toward peripheral, or outer-borough markets. This demand surge created a compounding effect, accelerating appreciation in Low-EA districts while High-EA growth stagnated.
It is essential to clarify that this analysis focuses on highlighting a change in growth rather than hierarchy. High-EA neighborhoods maintained their absolute price dominance throughout the pandemic. The reversal occurred primarily in appreciation rates. However, it remains unclear whether these findings reveal a permanent value change or a temporary disruption.
Contribution to Overarching Question
This analysis demonstrates that COVID-19 reshaped the relationship between neighborhood characteristics and property values by reversing the education premium, which was historically a strong predictor of urban real estate prices. Moreover, this finding connects with the team’s individual analyses of other neighborhood characteristics. This research:
Aligns with transit findings: Both accessibility and education premiums weakened.
Contextualizes density analysis: The apparent density penalty was inaccurate, likely driven by education.
Contrasts with job accessibility: Job accessibility remained stable while high-EA CDs lost their premium and low-EA CDs areas gained value, showing the education premium reversal.
Provides baseline context for crime results: Heterogeneous effects by initial crime conditions mirror our tercile patterns.
The education reversal appears to be the dominant structural shift, with other characteristics showing stability (jobs) or modest weakening (density and transit). This suggests pandemic housing dynamics re-calibrated a longstanding urban economic geography.
Limitations and Conclusion
While this analysis provides clear evidence of a post-pandemic reversal, several limitations inform the results. First, CD level data may obscure location-specific variations, such as gentrification within Low EA areas. Second, residents in high EA CDs may have held onto their properties; thus, low growth may have resulted from a lack of available real estate, masking retention premiums. Finally, as the data extends only through 2023, it is unclear if these findings are indicative of temporary disruptions or longstanding, permanent shifts, considering recent employer mandates calling for employees to return back to physical work locations.
Despite these limitations, this research reveals pre- and post-COVID shifts reversed the relationship between neighborhood EA and property values in NYC. The pandemic not only disrupted NYC property value logic, it reshaped buyer preferences. Consequently, affordability pressures created a new urban landscape where the traditional high EA clusters, a once historical driver of property value, are less significant with demand shifts toward value offered by outer-borough CDs.